

## A Time Efficient Comprehensive Model of Approximate Multipliers for Design Space Exploration

**ARITH 2024** 

Ziying Cui, Ke Chen, Bi Wu, Chenggang Yan, Yu Gong and Weiqiang Liu

**College of Integrated Circuits** 

Nanjing University of Aeronautics and Astronautics



#### Proposed Comprehensive Model

Evaluation and Analysis

Verification

Content

Conclusion

## Introduction



## Introduction

Intrinsic fault tolerance

human perception / noise floor





Trade-off performance, power, and error Multipliers

**Approximate Computing** 

## Fast and accurate model for Design Space Exploration

Adders, MAC,

Approximate

## **Motivation**



#### Proposed Comprehensive Model

Evaluation and Analysis

Verification

Content

Conclusion

## **Proposed Comprehensive Model**

> The model is mainly applied to the deliberately designed approximate multiplier.

#### **Approximate Approaches**





7

## **Proposed Error Model**

> The model can generate error metrics including



> An explicit model for uniform input distributions is provided, offering enhanced computational speed.

Tip1: When *a* and *b* is mutual independence, we have

 $\sum_{i=0}^{m} \sum_{j=0}^{n} a_{i} b_{j} = \sum_{i=0}^{m} a_{i} \sum_{j=0}^{n} b_{j}$ 

## **Proposed Error Model**

> A new paradigm is needed for approximate methods concerning two inputs.

Tip2: Utilizing the one-to-many relationships between relevant partial products and input combinations.

- Only the high n bits of A<sub>r</sub> are relevant to the first n rows of partial products.
- > When the *n* LSBs of  $B_r$  are all zero, it cuts the connection to the high *n* bits of  $A_r$ .
- ➤ The ED is pre-stored in a Look-up Table.





## **Proposed Error Model**

> Take *MED* of an *N*-bit multiplier using it-bit input-truncation for example.

$$A/B \longrightarrow A_{21}2^{it} + A_{0}$$

$$ED = E - A_{21}2^{it} \times B_{21}2^{it} = 2^{it} \times (A_{21}B_{0} + B_{21}A_{0}) + A_{0}B_{0}$$

$$MED' = \sum_{a_{21}=-Ed_{it}}^{Ed_{it}-1} \sum_{b_{21}=-Ed_{it}}^{Ed_{it}-1} \sum_{a_{0}=0}^{2^{it}-1} \sum_{b_{0}=0}^{2^{it}-1} a_{0}b_{0} \times p_{a}p_{b}$$

$$MED' = \sum_{a_{0}=0}^{Ed_{it}-1} a_{0} \times p_{a_{0}} \sum_{b_{0}=0}^{2^{it}-1} b_{0} \times p_{b_{0}}$$

$$ED_{it} = 2^{N-it-1} \qquad p_{a_{0}} = P(A_{0} = a_{0}) = \sum_{a_{21}=-Ed_{it}}^{Ed_{it}-1} P(A = a_{21}2^{it} + a_{0})$$

## **Proposed Hardware Model**

## **Partition Counting**

It can be expanded to different technology, approximate Booth encoding algorithm, and approximate compressor.

Area/Delay : multiply total/maximum number by corresponding area/delay.

$$A_{com}/T_{com} = C_{ppg}A_{ppg}/T_{ppg} + C_{ppc}A_c/T_{c-s} + C_aA_a/T_{a-c}$$

Power: curve fitting based on area due to strong correlation.



Taking array multiplier as an example:

 $P_{com} = 0.0001768A_{com} - 0.004179$ 

#### Proposed Comprehensive Model

Evaluation and Analysis

Verification

Content

Conclusion

## **Evaluation and Analysis: Error Model Precision**

Table I: The Accuracy and Runtime Compared with Exhaustive Simulation of MED, MRED, MAED and RMS<sub>ed</sub>.

| ]            | lasut                                                                                                 | MED                                                                                                                  |                   | MRED         |            |                          | MAED       |             |                   | RMS <sub>ed</sub> |             |                          | ]        |             |
|--------------|-------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|-------------------|--------------|------------|--------------------------|------------|-------------|-------------------|-------------------|-------------|--------------------------|----------|-------------|
|              | Input                                                                                                 | Diff/                                                                                                                | ex_time/          | runtime      | Diff/      | ex_time/                 | runtime    | Diff/       | ex_time/          | runtime           | Diff/       | ex_time/                 | runtime  |             |
|              | distribution                                                                                          | $10^{-13}$                                                                                                           | m_time(s)         | ratio        | $10^{-15}$ | m_time(s)                | ratio      | $10^{-13}$  | m_time(s)         | ratio             | $10^{-13}$  | m_time(s)                | ratio    |             |
|              | 8-bit array multiplier with 2-bit input-truncation, 3-bit partial-product-truncation and compensation |                                                                                                                      |                   |              |            |                          |            |             |                   |                   |             |                          |          |             |
|              | uniform                                                                                               | 0                                                                                                                    | 1.40 /<br>0.0012  | 1166.67      | 0          | 1.48 /<br>0.0031         | 477.42     | 0           | 1.47 /<br>0.088   | 16.7              | 0           | 1.45 /<br>0.0016         | 906.25   |             |
| Gaussian     | $\mu = 0 \ \sigma = 100$                                                                              | 4.09                                                                                                                 | 1.51 /<br>0.0047  | 321.28       | -0.059     | 1.57 /<br>0.0075         | 209.33     | -3.41       | 1.61 /<br>0.34    | 4.74              | -6.82       | 1.56 /<br>0.016          | 97.5     |             |
| Distribution | $\mu = 50$<br>$\sigma = 200$                                                                          | 8.6                                                                                                                  | 1.50 /<br>0.0047  | 319.16       | -0.247     | 1.60 /<br>0.0075         | 213.33     | 7.67        | 1.56 /<br>0.32    | 4.88              | 0.568       | 1.55 /<br>0.017          | 91.18    |             |
|              | 10-bit radix_                                                                                         | 4 booth m                                                                                                            | ultiplier wit     | th 3-bit inp | out-trunca | tion and co              | ompensatio | on, 2-bit p | artial-prod       | uct-trunca        | ition and c | compensati               |          |             |
|              | uniform                                                                                               | 0                                                                                                                    | 45.16 /<br>0.0021 | 21504.76     | 0          | 46.14 /<br>0.0077        | 5992.21    | 0           | 45.21 /<br>0.66   | 68.5              | 0           | 44.89 /<br>0.0029        | 15479.31 |             |
|              | $\mu = 0$<br>$\sigma = 100$                                                                           | 3.77                                                                                                                 | 27.34 /<br>0.008  | 3417.5       | 26         | 47.84 /<br>0.025         | 1913.6     | 23.3        | 46.88 /<br>3.77   | 12.44             | 3.41        | 47.54 /<br>0.038         | 1251.05  |             |
|              | $\mu = 50$<br>$\sigma = 200$                                                                          | 2.82                                                                                                                 | 46.64 /<br>0.0087 | 5360.92      | 1.11       | 47.72 /<br><u>0.02</u> 5 | 1908.8     | -235        | 47.92 /<br>3.83   | 12.51             | -5.68       | 49.69 /<br><u>0</u> .036 | 1380.28  |             |
|              | 12-                                                                                                   | 12-bit radix_8 booth multiplier with 1-bit input-truncation and compensation, 4-bit partial-product-truncation Multi |                   |              |            |                          |            |             |                   |                   | Multiplier  |                          |          |             |
|              | uniform                                                                                               | 0                                                                                                                    | 377.84 /<br>0.018 | 20991.11     | -141       | 388.10 /<br>0.087        | 4460.92    | 0           | 402.45 /<br>1.33  | 302.59            | 0           | 390.2 /<br>0.021         | 18580.95 | descriptior |
|              | $\mu = 0 \ \sigma = 100$                                                                              | 24.1                                                                                                                 | 411.9 /<br>0.13   | 3168.46      | -241       | 421.95 /<br>0.35         | 1205.57    | 1550        | 414.22 /<br>17.15 | 24.15             | 0.426       | 414.63 /<br>0.58         | 714.88   |             |
|              | $\mu = 50$<br>$\sigma = 200$                                                                          | 75                                                                                                                   | 407.36 / 0.13     | 3133.54      | -17.3      | 419.75 /<br>0.33         | 1354.03    | 12000       | 426.53 /<br>17.28 | 24.68             | -1.71       | 421.83 /<br>0.58         | 727.29   |             |

## **Evaluation and Analysis: Hardware Model Precision**



Fig.2: Hardware evaluation using 25 representative multipliers for area, power consumption and PDP.

|       | MRED    | WC_RED  |
|-------|---------|---------|
| Area  | -0.055‰ | -0.726‰ |
| Power | -0.28‰  | -1.90%  |
| Delay | -3.18%  | -10%    |
| PDP   | 3.2%    | -10.5%  |

- Area model has a high precision owe to accurate partition counting and almost constant cell area.
- Power model is slightly worse than that of the area.
- Delay model is affected by varying delays among the same elements.

## **Evaluation and Analysis: Error Model Performance**



<sup>1</sup>Encoding scheme, input-truncation bit-width, partial-product-truncation bit-width, compensation for input-truncation and compensation for partial-product-truncation

**ARITH 2024** 

#### 15

## **Evaluation and Analysis: Error Model Performance**



Fig.3: The time ratio of the Monte Carlo simulation method to the proposed method and the errors for four metrics

Proposed Comprehensive Model

Content

Evaluation and Analysis

Verification

Conclusion

## **Verification: Convolution**

#### > 20 representable multipliers form a 794 billion design space.



Comparable Pareto-optimal sets with 10 times faster

Fig.4: The Pareto-optimal values using the proposed model and Monte Carlo method.

[9] D. Sengupta, F. S. Snigdha, J. Hu, and S. S. Sapatnekar, "SABER: Selection of Approximate Bits for the Design of Error Tolerant Circuits," in 54th Annual Design Automation Conference, 2017, pp. 1–6.
 [10] J. Castro-Godnez, J. Mateus-Vargas, M. Shafique, and J. Henkel, "AxHLS: Design space exploration and high-level synthesis of approximate accelerators using approximate functional units and analytical models," in Proceedings of the 39th International Conference on Computer-Aided Design, 2020, pp. 1–9.
 [11] M. Awais, H. G. Mohammadi and M. Platzner, "An MCTS-based Framework for Synthesis of Approximate Circuits," IFIP/IEEE International Conference (VLSI-SoC), Verona, Italy, 2018, pp. 219-224.

## **Verification: Convolution**

> Select the optimal values under four accuracy constraints for two objects.

|   | MSE_set<br>/10 <sup>7</sup> | MSE_ex<br>/10 <sup>7</sup> | PSNR_ex<br>/dB        | Synthesis | Analytical<br>Model |  |
|---|-----------------------------|----------------------------|-----------------------|-----------|---------------------|--|
|   |                             | PDP-optimal                | PDP/ $\mu w \cdot ns$ |           |                     |  |
|   | 0                           | 0                          | +∞                    | 856.96    | /                   |  |
| I | 1                           | 0.98                       | 33.45                 | 176.02    | 173.05              |  |
|   | 2                           | 2.01                       | 30.34                 | 131.06    | 131.00              |  |
|   | 6                           | 5.98                       | 25.61                 | 74.28     | 73.02               |  |
|   | 10                          | 9.98                       | 23.38                 | 50.50     | 52.94               |  |
|   |                             | Area-optimal               | Area/ $\mu m^2$       |           |                     |  |
|   | 0                           | 0                          | +∞                    | 2158.76   | /                   |  |
| I | 1                           | 0.98                       | 33.45                 | 604.93    | 604.95              |  |
|   | 2                           | 2.00                       | 30.36                 | 475.27    | 474.61              |  |
|   | 6                           | 5.98                       | 25.61                 | 287.28    | 288.46              |  |
|   | 10                          | 9.96                       | 23.39                 | 225.41    | 226.20              |  |

MRED of model error is **2.3‰** and **2.1%** for PDP and Area.

**79.46%** reduction of PDP is achieved when the MSE is constrained to 10^7.

Area can be reduced by **71.98%** when the optimal area is required

## **Verification: Gaussian Blur**

25 8-bit grayscale maps for input distribution

20dB and 30dB restriction







 $Area = 951.05 \mu m^2$  $PDP = 294.98 \mu W \cdot ns PP$ 

PDP, 20.37dBPDP, 30.00dBPDP =  $34.32\mu W \cdot ns$ PDP =  $92.66\mu W \cdot ns$ 

# **68.59%** less PDP for 30dB restriction

# **56.21%** less area for 30dB restriction





Area, 20.37 dB $Area = 221.89 \mu m^2$ 

Area, 30.04dB $Area = 416.43\mu m^2$ 

Fig.5: Resulting images of Gaussian blur for different targets.

#### Proposed Comprehensive Model

Evaluation and Analysis

## Verification

Conclusion

## Content

## Conclusions

- In proposed comprehensive model, five distinct quality metrics in addition to four hardware metrics are derived.
- In comparison to the Monte Carlo simulation method, the proposed model demonstrates a remarkable reduction in runtime, with an average decrease of 120.85, and in specific instances, as low as 2,500.
- The 3\*3 convolution circuit and Gaussian Blur application is employed to verify the proposed model with 79.46% reduction in PDP and a 71.98% reduction in area when compared to the accurate counterpart.

# Thanks!